Cluster Hypothesis in Low-Cost IR Evaluation with Different Document Representations

نویسندگان

  • Kai Hui
  • Klaus Berberich
چکیده

Offline evaluation for information retrieval aims to compare the performance of retrieval systems based on relevance judgments for a set of test queries. Since manual judgments are expensive, selective labeling has been developed to semiautomatically label documents, in the wake of the similarity relationship among retrieved documents. Intuitively, the agreement w.r.t the cluster hypothesis can directly determine the amount of manual judgments that can be saved by creating labels with a semi-automatic method. Meanwhile, in representing documents, certain information is lost. We argue that better document representation can lead to better agreement with the cluster hypothesis. To this end, we investigate different document representations on established benchmarks in the context of low-cost evaluation, showing that different document representations vary in how well they capture document similarity relative to a query.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Personal Name Resolution of Web People Search

Disambiguating personal names in a set of documents (such as a set of web pages returned in response to a person name) is a difficult and challenging task. In this paper, we explore the extent to which the “cluster hypothesis” for this task holds (i.e., that similar documents tend to represent the same person). We explore two clustering techniques which used either (1) term based matching (sing...

متن کامل

Document Clustering Algorithms, Representations and Evaluation for Information Retrieval

Digital collections of data continue to grow exponentially as the information age continues to infiltrate every aspect of society. These sources of data take many different forms such as unstructured text on the world wide web, sensor data, images, video, sound, results of scientific experiments and customer profiles for marketing. Clustering is an unsupervised learning approach that groups sim...

متن کامل

Testing the cluster hypothesis in distributed information retrieval

How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single merged list. Cluster-based retrieval results presentation is based on the cluster hypothesis, which states that documents that cluster together...

متن کامل

Document Clustering: Before and After the Singular Value Decomposition

Document Clustering is an issue of measuring similarity between documents and grouping similar documents together. Information Retrieval (IR) is an issue of comparing query with a collection of documents to locate a set of documents relevant to a particular query. In the vector space IR model, a query is treated as a document which consists of a few terms. Therefore, in both clustering and retr...

متن کامل

Semi-Structured Document Classification

INTRODUCTION Document classification developed over the last ten years, using techniques originating from the pattern recognition and machine learning communities. All these methods do operate on flat text representations where word occurrences are considered independents. The recent paper (Sebastiani, 2002) gives a very good survey on textual document classification. With the development of st...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016